Bivariate Plots Section

The above correlation matrix helps in identifying some of the interesting trends in the data.

We have a high correlation between


But, before we plot scatter plots to visualize these correlations, we have to normalize the data ranges of the above mentioned four features.


After normalizing to be in the range of [0,1]. We get the following output:

##          views       likes     dislikes comment_count
## 1: 0.117422509 0.122986452 1.236070e-02  0.0213883870
## 2: 0.015345060 0.072727590 5.189260e-03  0.0304550596
## 3: 0.007477869 0.002425697 3.487775e-04  0.0002379588
## 4: 0.006732219 0.015137654 4.240274e-04  0.0070895577
## 5: 0.005249547 0.003901351 7.608605e-04  0.0010671426
## 6: 0.004773304 0.004023864 8.719437e-05  0.0010825658

Using suitable limits for the X and Y axis:

We can clearly observe the correlation that we found out previously using the correlation matrix.

We can see the variation in the features of likes, dislikes, comment count and days on trending in the following plots.

The notable but not considerable correlations like in between views and dislikes and that between views and comment count are visible here.

As the number of views increases, the other features also increase which is in agreement with our calculated statistics in the univariate plots section.

Moreover, only a small number of videos trend for 10 days or more.

Before we compare how the different features vary across the different categories, we take a look at the top 25 trending channels.

ESPN, Vox and Netflix are the top 3 channels (in that order) in terms of number of trending videos.

How these trending channels vary across the categories is something that we will take a look at in the multivariate plots section.

## [1] "Summary for views feature:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       559     95070    331600   1278000   1025000 149400000
## [1] "Summary for likes feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1600    7726   39720   25880 3094000
## [1] "Summary for dislikes feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      79     302    2598    1058 1674000
## [1] "Summary for comment count feature:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0     238.0     888.5    4976.0    2914.0 1362000.0
## [1] "Summary for days on trending feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.958   7.000  14.000
## [1] "Summary for number of trending videos feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   24.00   37.58   62.00  114.00

The categories of Entertainment and Music have very high values for all the features across the board.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.

Overall, it seems that Shows is one category that consistently appears on trending.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We have a high correlation between the number of views and number of likes both in the entire dataset (0.83) and in the Top 100 Trending Videos (0.85).

We also have a high correlation between the number of dislikes and comment count in the entire dataset (0.83) but it drastically reduces in the Top 100 Trending Videos (0.36).

The variation of the different features across categories varies quite widely depending upon the feature under investigation.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.

In the Top 100 Trending Videos, the trend of Entertainment and Music having values across all the features still continues but is a lot less pronounced with other categories like Film/Animation and People & Blogs coming close.

Pets & Animals is surprising the category with the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

In the Top 100 Trending Videos subset of the dataset, the Pets & Animals category has the highest median, 1st and 3rd quantile values.

What was the strongest relationship you found?

The strongest correlation was between views and likes with 0.83 in the complete dataset and 0.85 in the Top 100 Trending Videos.



Multivariate Plots Section

We have already seen the top trending channels. Here, we take a look at those channels and see how they are distributed across the different categories.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   82.00   88.00   98.00   98.48  108.00  114.00

All of these channels post videos only of a specific category as we can see no channel having bars of different colours.

Furthermore, as we saw in the bivariate plots section, the most common categories are Entertainment and Sports.

ESPN having the highest number of trending videos (114). Both the mean and median of number of trending videos being 98.

In the bivariate plots section, we observed how the various features vary across different categories of videos. Now in this section, we can explore how these features vary across the Top 25 trending channels and see the classifications and conclusions of the bivariate plots section in action.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   297200  2390000  2941000  3874000  4101000 13030000

ESPN, NBA, Netflix and Vox are the top channels in terms of number of videos on trending. But, NFL, WIRED and WWE are very ahead of these trending channels in terms of number of views.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3377   28130   50840   78150  102800  436500

Again, ESPN, NBA, Netflix and Vox are the top channels in terms of number of videos on trending. But, First We Feast, NFL, The Tonight Show Starring Jimmy Fallon and WIRED, and WWE are higher in terms of number of likes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     217    1761    3218   12300    7591  132400

Here we see, even NFL is coming close to the above four channels that we see topping the plots previously. We see a massive spike in the number of dislikes for Washington Post (132400) a News & Politics channel which is a order of magnitude higher than the median. Although, again, ESPN and NFL are still in the top 3.

(Considering the mean for comparision is not a good decision for this plot as there is a very large outlier to skew the mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     852    2771    7572   11560   18850   35910

Again, we see large spikes for CNN, WIRED and Washington Post all of which are News & Politics channels.

(The maximum value is again an order of magnitude higher than the median value)

Hence, continuing the trend of people expressing their wide range of opinions not only through dislikes but also through comments.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    5.00    4.64    6.00    8.00

Although, the domination of ESPN, NBA, Netflix and Vox continues in terms of number of videos on trending, we see that only NBA channel’s videos trend for a time longer than the average.

Also, channel’s like ABC News, Bon Appetit, Great Big Story and WIRED all produce videos that trend for a longer time than the channel’s mentioned above.



Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Looking at the top trending channels in terms of the number of videos trending, the same four channels ESPN, NBA, Netflix and Vox showed consistent domination across all the features except number of days on trending.

News & Politics channels in general, had very large values for features like dislikes, comment count and days on trending indicating strongs opinions of the people on the content posted by these channels.

Were there any interesting or surprising interactions between features?

Features like likes, views etc. were generally higher for channels posting videos of the categories of Entertainment and Music.

Features like dislikes, comment count and days on trending were generally orders of magnitude higher for News & Politics channels.



Final Plots and Summary

Plot One

Description One

We visualize the correlations between views and likes (both normalized; since numerically the number of views is always higher in the dataset) on the left and that between dislikes and comment count (also normalized) on the right.

One important trend we can observe from these correlations is that if a video has a high number of views then it also has a high number of likes indicating (not inferring) that the chance of being being popular due to bad publicity is low.

From the second plot, we can see that if a video has a high number of dislikes then it also tends to have a high comment count indicating that the video may be controversial and hence the people are more likely to leave a comment.

Of course, these claims are just based on correlations and a hypothesis test is required to prove any causation.

Plot Two

## [1] "IQR of Shows for views:"
## yt_trending$category_name == "Shows": FALSE
## [1] 930800.2
## -------------------------------------------------------- 
## yt_trending$category_name == "Shows": TRUE
## [1] 53558
## [1] "IQR of Shows for likes:"
## yt_trending$category_name == "Shows": FALSE
## [1] 24284.75
## -------------------------------------------------------- 
## yt_trending$category_name == "Shows": TRUE
## [1] 1816.5
## [1] "IQR of Shows for dislikes:"
## yt_trending$category_name == "Shows": FALSE
## [1] 979
## -------------------------------------------------------- 
## yt_trending$category_name == "Shows": TRUE
## [1] 67
## [1] "IQR of Shows for comment count:"
## yt_trending$category_name == "Shows": FALSE
## [1] 2676.75
## -------------------------------------------------------- 
## yt_trending$category_name == "Shows": TRUE
## [1] 867
## [1] "IQR of Shows for days on trending:"
## yt_trending$category_name == "Shows": FALSE
## [1] 4
## -------------------------------------------------------- 
## yt_trending$category_name == "Shows": TRUE
## [1] 3

Description Two

Specifically, focussing on the Shows category of videos, we can see that of all the categories of videos across all of the features they have the least IQR value but also trend for the longest duration of time as can be seen from from the plots.

Plot Three

Description Three

Between the two plots, we can see that videos posted by Washington Post both have the highest number of dislikes and one of the highest number of comments.

Looking at other channels in the above plots with notable values, ESPN, NFL, CNN and WIRED have some of the highest values for number of comments.

An interesting point to note is that, all of these channels post videos excludively of either the News & Politics or Sports category which are categories having supporters with different viewpoints and opinions and hence a high possibility of disagreement and this is clealy visible in the distribution of the data in the above plots.



Reflection

The YouTube Trending Statistics is a daily record of the top trending videos on the video sharing platform YouTube. The data used is that from the USA. It contains data about 23,362 (4,712 unique) videos across 13 features.

I began the analysis process by first preprocessing the data to remove some of the features and converting the dataset into a data.table type for quick access, merging etc. of the data. Some features like days_on_trending and number_of_trending_videos were derived from the data present in the dataset and added to it.

The analysis was carried out both over the entire dataset and over the Top 100 Trending Videos (a subset of the entire dataset). In both cases, views and likes had a high correlation but dislikes and comment count had a high correlation only over the entire dataset but not in the Top 100 Trending Videos.

Entertainment and Music categories dominate the Top 100 Trending Videos in terms of the values of various features like views, likes, comment count and days on trending. Also, videos of the category Shows trend for the longest time.

Further, looking at the top trending channels, positively indicative features like views and likes were high for the channels posting Entertainment and Music category of videos. While, features indicating controvery were high for channels posting videos of the News & Politics category.

This trend is easily observed online. Politics tends to be a category that generates a lot of controvery as people with different opinions come in contact. Whereas, music is in general a category that a majority of viewers have a similar opinion about.
This facet of online interactions is quite accurately captured by this analysis.

I also tried to include a feature measuring the number of days a video took from its date of upload to start trending but ultimately could not due to not being able to preprocess the date format given in the dataset.

In addition to the analysis performed here, we could do a sentiment analysis using the title of the videos, video description and tags of the video. This can help in more accurately predicting the different features of the videos. This type of analysis can help understand how the perception of the audience varies over different types of content and consequently help in the development of future content.